Parth Rege - Reinforcement Learning

EN.685.801.21.SP24

Professors Rodriguez, Saeed, Johnson

Contents

  1. Load Data
  2. Generate Features
  3. Q-Learning Algorithm
  4. Deep Q-Learning Algorithm
        4.1 DQN with Policy and Target Network
        4.2 DQN with Policy Network Only
        4.3 DQN with Policy Network Only But No Replay
        4.4 DQN with Policy Network and Regularization
        4.5 DQN with Policy and Target Network and Regularization
  5. Future Considerations

1 (Load Data) Load 3 years of AAPL data pulled from NASDAQ w/ Sentiment

^ Contents

In our previous module submission, we detailed the creation of the sentiment included dataset for AAPL stock price data. To simplify ETL, I will be utilizing that data as the starting point for this module submission. The process here will be:

  1. Load and prep AAPL daily sentiment dataset from earlier module submission
  2. Filter daily AAPL data for year in which you want to conduct RL analysis simulation. I will be using 2023 for this.
  3. Load filter and join Fed rate (TBill Avg Rate) data by Quarter.

Reference data is sourced from: https://fiscaldata.treasury.gov/datasets/average-interest-rates-treasury-securities/average-interest-rates-on-u-s-treasury-securities

Now that we have our sourced features, we will move onto the next section in order to generate features to help feed our eventual RL algorithms.

2 (Generate Features) Feature Engineering and LSTM Implementation

^ Contents

We will now create lagged features for close price and sentiment while also creating a delta fed rate column to capture the month-over-month change in fed rate (this will be to help train the RL agent to make policy based on changing fed rate).

Now that we have the base data, let us utilize an LSTM to include a forecasted next-day price for each state as an additional feature.

Now let's just remove NaNs so that the RL agent is not training using a NaN.

Let's also quickly just look at these predictions versus the real close prices. Please note that there will be a bit of a forward shift in the data given that we are specifically trying to give each day's close price a peak into the future by giving it tomorrow's predicted close price thus the plot will look slightly shifted.

Now that we have completed sourcing, loading, preparing, cleaning the data while also generating relevant features, this dataframe can now be used to train a Q-Learning and Deep Q-Learning algorithm.

3 (Q-Learning Algorithm) Implement a Simple Q-Learning Agent and Evaluate Performance

^ Contents

To understand how this is set up, we will do the following:

  1. action - chosen by random exploring at the beginning of the episode, then policy based exploiting later on in the episodes.
  2. Updating of Q-table will work through utilization of the Bellman equation to create a consistently optimized Q-table to help the RL agent makes action decisions.
  3. reward will be calculated one of two ways (1) simple method of reward at each time is the portfolio value given the price the stock was bought at and (2) the complicated method which gives a side reward of profit or loss depending on the action made yesterday Note: the simple method was not stable so I switched to method 2
  4. The RL is then trained on 50,000 episodes, where in each episode it will run through the entire years worth of states (close price, sentiment, fed rate, forecasted values, etc.) and make decisions based on actions.
  5. We will then plot out the portfolio value and reward value until it converges.

So on avergae, we are turning our 1000 USD initial investment into 1167.05 USD but we see the algorithm acting quite stable above with a high of 1420.69 USD and low of 976.52 USD.

4 (Deep Q-Learning Algorithm) Implement a Deep Q-Learning (DQN) Agent and Evaluate Performance

^ Contents

Now the DQN builds upon the simple Q-Learning algorithm as follows:

  1. Builds a simple neural network with an input layer, one hidden layer, and an output layer. And then it copies the weights from the policy network to the target network to synchronize them (cloning). This ensures the target network is a stable reference for Q-value updates during training.
  2. It then chooses the exploitation version of the action using a prediction from the policy network.
  3. Experience storage and replay which updates the corresponding Q-value in the current state based on the calculated target, then trains the policy network on this updated Q-value.
  4. The reward is the same as in the simple Q-Learning algorithm, simply removing negative rewards and replacing them with 0.
  5. We will be including regularization and/or dropout to the Q-network to help avoid overfitting, incentivize generalization, and obtain better training stability.

References:

  1. https://github.com/Albert-Z-Guo/Deep-Reinforcement-Stock-Trading/tree/master
  2. https://medium.com/@shruti.dhumne/deep-q-network-dqn-90e1a8799871
  3. https://github.com/conditionWang/DRQN_Stock_Trading
  4. https://medium.com/@murrawang/deep-q-network-and-its-application-in-algorithmic-trading-16440a112e04

Note also that we will be showing a few variations of the DQN where some aspects are either added or removed to see if stability/convergence can be reached through training.

4.1 DQN with Policy and Target Network

^ Contents

Note that I will also show an execution of the NN with only a policy network in Part 3.

So we are seeing generally that the more we update replay and target networks the higher highs our algorithm can have on average. We are not seeing true convergence here though although the agent does tend to stay within a range as shown here.

4.2 DQN with Policy Network Only

^ Contents

We are now seeing here that as we are only using a policy network, it might seem slightly more erratic, I'm not sure. We can see when comparing this to the use of policy and target network that there is a bit more of a range where the rewards and portfolio value operate within and my belief is that this is because the target network is not helping with overall stability.

4.3 DQN with Policy Network Only But No Replay

^ Contents

This is frankly stagnant without the experience replay. This is because the neural network is not being trained with experiences from the replay function. It is definitely more stable, but not hitting the higher average portfolio value that we saw with the experience replay and double networks (policy and target). The agent also is not converging although it's erratic behavior seems to be less, which is because the random experience replay is not retraining the Q-network. However, this means we might as well just use the Q-table method.

4.4 DQN with Policy Network and Regularization

^ Contents

Basically I read that adding regularization helps control overfitting, leading to a more robust and generalized policy network for the DQN agent. This makes the training process more stable and increases the likelihood of convergence. This is because regularization will penalize larger weights and dropout (not included here but in part 4.5) will randomly remove neurons so the remaining weights are forced to work and generalize better. Generally a value of 0.2 is a good starting point for both but a value too high could cause adverse effects (too much generalization).

References:

  1. https://arxiv.org/pdf/1810.00123
  2. https://medium.com/analytics-vidhya/regularization-understanding-l1-and-l2-regularization-for-deep-learning-a7b9e4a409bf
  3. https://www.e2enetworks.com/blog/regularization-in-deep-learning-l1-l2-dropout
  4. https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/

Okay this is actually not horrible and we see a generally higher average portfolio and some stability in the curves at the end. Note that there is no dropout and only a little bit of regularization added. My guess is if we include both, we will probably see substantial stabilization in the training curve so let's try this now.

4.5 DQN with Policy and Target Network and Regularization

^ Contents

Ok frankly I just wanted to test regularization on the larger two network model to see if it stabilized like we see above.

This looks like it is converging well with a few small up and down ticks. This has to be due to the implementation of the double networks and the regularization/dropout portions added to the Q-network. We also see an average $1329.99 portfolio value which is the best average we have seen with the other variations.

5 (Future Considerations) Notes On Potential Future Areas For Research

^ Contents

Although we have exhausted a number of Reinforcement Learning topics, there are still multiple improvements and future work that could be pursued:

  1. A more complicated network architecture - deeper neural networks with varied activation functions, hypertuning of regularization and dropout, etc..
  2. Different exploration strategies - i.e. Thompson Sampling
  3. Double DQN - to help reduce overestimation bias in Q-learning by decoupling the action selection from the target value computation.
  4. A more complex reward function to potentially have better portfolio returns.
  5. Testing on multiple years or different years of data (for now we have tested on only 2023 calendar year of AAPL which was a good year but what happens in a down year (i.e. 2022).